Language individuation and marker words: Shakespeare and his Maxwell's demon

Marsden, John; Budden, David; Craig, Hugh; Moscato, Pablo

Title: Language individuation and marker words: Shakespeare and his Maxwell's demon
Creator: Marsden, John; Budden, David; Craig, Hugh; Moscato, Pablo
Relation: PLoS One Vol. 8, Issue 6
Publisher Link: http://dx.doi.org/10.1371/journal.pone.0066813
Publisher: Public Library of Science
Resource Type: journal article
Date: 2013
Description: Background: Within the structural and grammatical bounds of a common language, all authors develop their own distinctive writing styles. Whether the relative occurrence of common words can be measured to produce accurate models of authorship is of particular interest. This work introduces a new score that helps to highlight such variations in word occurrence, and is applied to produce models of authorship of a large group of plays from the Shakespearean era. Methodology: A text corpus containing 55,055 unique words was generated from 168 plays from the Shakespearean era (16th and 17th centuries) of undisputed authorship. A new score, CM1, is introduced to measure variation patterns based on the frequency of occurrence of each word for the authors John Fletcher, Ben Jonson, Thomas Middleton and William Shakespeare, compared to the rest of the authors in the study (which provides a reference of relative word usage at that time). A total of 50 WEKA methods were applied for Fletcher, Jonson and Middleton, to identify those which were able to produce models yielding over 90% classification accuracy. This ensemble of WEKA methods was then applied to model Shakespearean authorship across all 168 plays, yielding a Matthews' correlation coefficient (MCC) performance of over 90%. Furthermore, the best model yielded an MCC of 99%. Conclusions: Our results suggest that different authors, while adhering to the structural and grammatical bounds of a common language, develop measurably distinct styles by the tendency to over-utilise or avoid particular common words and phrasings. Considering language and the potential of words as an abstract chaotic system with a high entropy, similarities can be drawn to the Maxwell's Demon thought experiment; authors subconsciously favour or filter certain words, modifying the probability profile in ways that could reflect their individuality and style.
Subject: language; Shakespeare; Maxwell's Demon; data mining
Identifier: http://hdl.handle.net/1959.13/1048893
Identifier: uon:14966
Identifier: ISSN:1932-6203
Rights: © 2013 Marsden et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Language: eng
Full Text
Reviewed

Hits: 1670
Visitors: 2507
Downloads: 467

		Thumbnail	File	Description	Size	Format
View Details Download			ATTACHMENT01	Publisher version (open access)	902 KB	Adobe Acrobat PDF	View Details Download